1 Introduction

A password, sometimes called a passcode, is a memorized secret used to confirm the identity of a user. Despite recent awareness on the need to use strong password to ward off potential hackers hacking into and acquiring users’ sensitive information, there are still several lists of bad passwords that are being used worldwide.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
passwords <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-14/passwords.csv')
## Parsed with column specification:
## cols(
##   rank = col_double(),
##   password = col_character(),
##   category = col_character(),
##   value = col_double(),
##   time_unit = col_character(),
##   offline_crack_sec = col_double(),
##   rank_alt = col_double(),
##   strength = col_double(),
##   font_size = col_double()
## )
passwords
## # A tibble: 507 x 9
##     rank password category value time_unit offline_crack_s… rank_alt strength
##    <dbl> <chr>    <chr>    <dbl> <chr>                <dbl>    <dbl>    <dbl>
##  1     1 password passwor…  6.91 years          2.17               1        8
##  2     2 123456   simple-… 18.5  minutes        0.0000111          2        4
##  3     3 12345678 simple-…  1.29 days           0.00111            3        4
##  4     4 1234     simple-… 11.1  seconds        0.000000111        4        4
##  5     5 qwerty   simple-…  3.72 days           0.00321            5        8
##  6     6 12345    simple-…  1.85 minutes        0.00000111         6        4
##  7     7 dragon   animal    3.72 days           0.00321            7        8
##  8     8 baseball sport     6.91 years          2.17               8        4
##  9     9 football sport     6.91 years          2.17               9        7
## 10    10 letmein  passwor…  3.19 months         0.0835            10        8
## # … with 497 more rows, and 1 more variable: font_size <dbl>

Take note: Below are the definitions of each column

rank: Popularity in their database of released passwords password: Actual text of the password category: What category does the password fall in to? value: Time to crack by online guessing time_unit: Time unit to match with value offline_crack_sec: Time to crack offline in seconds rank_alt: Rank 2 strength: Strength = quality of password where 10 is highest, 1 is lowest, please note that these are relative to these generally bad passwords font_size: Used to create the graphic for KIB

2 Cleansing data

str(passwords)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 507 obs. of  9 variables:
##  $ rank             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ password         : chr  "password" "123456" "12345678" "1234" ...
##  $ category         : chr  "password-related" "simple-alphanumeric" "simple-alphanumeric" "simple-alphanumeric" ...
##  $ value            : num  6.91 18.52 1.29 11.11 3.72 ...
##  $ time_unit        : chr  "years" "minutes" "days" "seconds" ...
##  $ offline_crack_sec: num  2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
##  $ rank_alt         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ strength         : num  8 4 4 4 8 4 8 4 7 8 ...
##  $ font_size        : num  11 8 8 8 11 8 11 8 11 11 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   rank = col_double(),
##   ..   password = col_character(),
##   ..   category = col_character(),
##   ..   value = col_double(),
##   ..   time_unit = col_character(),
##   ..   offline_crack_sec = col_double(),
##   ..   rank_alt = col_double(),
##   ..   strength = col_double(),
##   ..   font_size = col_double()
##   .. )
passwords$category <- as.factor(passwords$category)
class(passwords$category)
## [1] "factor"
passwords$time_unit <- as.factor(passwords$time_unit)
class(passwords$time_unit)
## [1] "factor"
passwords %>% 
  is.na() %>% 
  colSums()
##              rank          password          category             value 
##                 7                 7                 7                 7 
##         time_unit offline_crack_sec          rank_alt          strength 
##                 7                 7                 7                 7 
##         font_size 
##                 7

As a rule of thumb, because the number of NA is below 5% of the data, we can delete the rows on the missing data.

passwords_new <- passwords %>% 
  drop_na(rank, password, category, value, time_unit, offline_crack_sec, rank_alt, strength, font_size)

Check the number of NA once again

passwords_new %>% 
  is.na() %>% 
  colSums()
##              rank          password          category             value 
##                 0                 0                 0                 0 
##         time_unit offline_crack_sec          rank_alt          strength 
##                 0                 0                 0                 0 
##         font_size 
##                 0

3 Initial insight on the data

str(passwords_new)
## Classes 'tbl_df', 'tbl' and 'data.frame':    500 obs. of  9 variables:
##  $ rank             : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ password         : chr  "password" "123456" "12345678" "1234" ...
##  $ category         : Factor w/ 10 levels "animal","cool-macho",..: 7 9 9 9 9 9 1 10 10 7 ...
##  $ value            : num  6.91 18.52 1.29 11.11 3.72 ...
##  $ time_unit        : Factor w/ 7 levels "days","hours",..: 7 3 1 5 1 3 1 7 7 4 ...
##  $ offline_crack_sec: num  2.17 1.11e-05 1.11e-03 1.11e-07 3.21e-03 1.11e-06 3.21e-03 2.17 2.17 8.35e-02 ...
##  $ rank_alt         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ strength         : num  8 4 4 4 8 4 8 4 7 8 ...
##  $ font_size        : num  11 8 8 8 11 8 11 8 11 11 ...
summary(passwords_new)
##       rank         password                        category       value       
##  Min.   :  1.0   Length:500         name               :183   Min.   : 1.290  
##  1st Qu.:125.8   Class :character   cool-macho         : 79   1st Qu.: 3.430  
##  Median :250.5   Mode  :character   simple-alphanumeric: 61   Median : 3.720  
##  Mean   :250.5                      fluffy             : 44   Mean   : 5.603  
##  3rd Qu.:375.2                      sport              : 37   3rd Qu.: 3.720  
##  Max.   :500.0                      nerdy-pop          : 30   Max.   :92.270  
##                                     (Other)            : 66                   
##    time_unit   offline_crack_sec     rank_alt        strength     
##  days   :238   Min.   : 0.00000   Min.   :  1.0   Min.   : 0.000  
##  hours  : 43   1st Qu.: 0.00321   1st Qu.:125.8   1st Qu.: 6.000  
##  minutes: 51   Median : 0.00321   Median :251.5   Median : 7.000  
##  months : 87   Mean   : 0.50001   Mean   :251.2   Mean   : 7.432  
##  seconds: 11   3rd Qu.: 0.08350   3rd Qu.:376.2   3rd Qu.: 8.000  
##  weeks  :  5   Max.   :29.27000   Max.   :502.0   Max.   :48.000  
##  years  : 65                                                      
##    font_size   
##  Min.   : 0.0  
##  1st Qu.:10.0  
##  Median :11.0  
##  Mean   :10.3  
##  3rd Qu.:11.0  
##  Max.   :28.0  
## 

From the data above, we can conlude a few things: 1. There are 10 categories of bad passwords used by people. 2. Based on the category of the passwords, the category name is the most frequently used 3. The mean strength of these bad passwords (considering that they are bad password) is 7.432 out of 10, where 10 is the highest and 1 is the lowest, whereas the median strength is 7. 4. The average value to crack these bad passwords online is roughly 5.603 days whereas the median value is 3.720 days. (Days is chosen as it is the mode of the time unit) 5. The mean value to crack these passwords offline is 0.5 seconds while the median is 0.00321 seconds.

4 Plots

4.1 Distribution plot

Next, we are going to examine the relationship between the strength of these passwords and the time to crack these passwords offline as the strength of the passwords is solely based on the time for computers to crack the passwords online, instead of through guessing offline.

library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot1 <- ggplot(data = passwords_new, mapping = aes(x = strength, y = offline_crack_sec)) +
  geom_jitter(aes(color = category)) +
  geom_smooth(method = "auto") +
  labs(x = "Strength", y = "Time to Crack Offline in Seconds", title = "Time to crack offline in seconds vs Strength of Passwords based on Online Guessing") +
  theme_minimal() +
  theme(legend.position = "none")

ggplotly(plot1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

From the plot above, we can conclude that there is a weak positive correlation between the time to crack these passwords offline and the strength of the passwords based on online guessing, although it is to be noted that there are extreme outliers that have take very little time to crack offline but is regarded as strong password by computers.

4.2 Box plot

Next, we will try to see the strength of these passwords based on their category to learn some insight on which type of password is more easily guessed by computers.

plot2 <- ggplot(data = passwords_new, mapping = aes(x = category, y = strength)) +
  geom_boxplot(aes(fill = category)) +
  labs(x = "Category of passwords", y = "Strength", title = "Strength of Passwords based on their Categories") +
  theme_minimal() +
  theme(legend.position = "none") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

ggplotly(plot2)

Based on this box plot alone, we can tell that passwords that are classified as sport, nerdy pop, name and cool macho have the highest median strength of 8 while simple alphanumeric passwords have the lowest median strength of 4.

Both simple alphanumeric passwords and nerdy pop passwords have the most amount of outliers (5), however, nerdy pop passwords’ outliers tend to have higher strength than simple alphanumeric passwords’ outliers.

On the other hand, let’s compare the rank of these passwords based on their categories.

plot3 <- ggplot(data = passwords_new, mapping = aes(x = category, y = rank)) +
  geom_boxplot(aes(fill = category)) +
  labs(x = "Category of passwords", y = "Rank", title = "Popularity of Passwords based on their Categories") +
  theme_minimal() +
  theme(legend.position = "none") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

ggplotly(plot3)

According to the plot above, the category with the lowest median rank (146) is password-related passwords (which means it is the most popular), whereas the category with the highest median rank (295) is cool macho passwords (least popular).

Let’s try to combine the conclusions of these two graphs to have a more meaningful insight.

Both boxplots show that despite cool macho passwords being the strongest type of passwords, people do not prefer to use this type of password the most. Similarly, passwords with categories such as name and nerdy pop also do not fare well in terms of popularity and usage despite being the strongest to withstand cracking via computer.

Next, let’s compare it via guessing offline.

plot4 <- ggplot(data = passwords_new, mapping = aes(x = category, y = offline_crack_sec)) +
  geom_boxplot(aes(fill = category)) +
  scale_y_continuous(breaks = seq(0,0.005), limit = NA) +
  labs(x = "Category of passwords", y = "Time to crack offline in seconds", title = "Time to crack these passwords offline based on their Categories") +
  theme_minimal() +
  theme(legend.position = "none") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

ggplotly(plot4)

From this, we can only find out that the median time required to crack nerdy-pop passwords offline is the highest at 0.04s.

This concludes that nerdy-pop passwords is the strongest to crack, both offline and online.

This conclusion is particularly useful to users when deciding which type of passwords to use in order to maximise their safety against hackers.

Next, we want to know the most used password, yet has the least strength to withstand cracking offline and online, so that we know which particular password to avoid using ever.

5 Worst password Among The Worst?

passrank <- passwords_new %>% 
  filter(strength == 0) %>% 
  filter(rank < 100)
passrank
## # A tibble: 5 x 9
##    rank password category value time_unit offline_crack_s… rank_alt strength
##   <dbl> <chr>    <fct>    <dbl> <fct>                <dbl>    <dbl>    <dbl>
## 1    19 111111   simple-… 18.5  minutes        0.0000111         19        0
## 2    20 2000     simple-… 11.1  seconds        0.000000111       20        0
## 3    46 pepper   food      3.72 days           0.00321           46        0
## 4    60 666666   simple-… 18.5  minutes        0.0000111         60        0
## 5    77 1111     simple-… 11.1  seconds        0.000000111       77        0
## # … with 1 more variable: font_size <dbl>
library(ggrepel)
plot5 <- ggplot(data = passwords_new, mapping = aes(x = password, y = rank)) +
  geom_jitter(aes(colour = strength)) +
  geom_label_repel(data = passrank, aes(label = password), size = 2) +
  facet_wrap(~strength) +
  labs(x = NULL, y = "Rank", title = "Passwords based on rank and strength") +
  theme_minimal() +
  theme(legend.position = "none") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_blank())

plot5

Based on this, we highlighted on passwords 2000, 111111, pepper, 1111 and 666666 with rank of below 100 and of strength 0. Now, let’s see whether these passwords will be highlighted again when compared with time to crack them offline.

passtime <- passwords_new %>% 
  filter(offline_crack_sec < median(offline_crack_sec)) %>% 
  filter(rank < 100)
passtime
## # A tibble: 19 x 9
##     rank password category value time_unit offline_crack_s… rank_alt strength
##    <dbl> <chr>    <fct>    <dbl> <fct>                <dbl>    <dbl>    <dbl>
##  1     2 123456   simple-… 18.5  minutes        0.0000111          2        4
##  2     3 12345678 simple-…  1.29 days           0.00111            3        4
##  3     4 1234     simple-… 11.1  seconds        0.000000111        4        4
##  4     6 12345    simple-…  1.85 minutes        0.00000111         6        4
##  5    12 696969   simple-… 18.5  minutes        0.0000111         12        1
##  6    19 111111   simple-… 18.5  minutes        0.0000111         19        0
##  7    20 2000     simple-… 11.1  seconds        0.000000111       20        0
##  8    24 1234567  simple-…  3.09 hours          0.000111          24        4
##  9    34 test     passwor…  7.92 minutes        0.00000475        34        4
## 10    35 pass     passwor…  7.92 minutes        0.00000475        35        3
## 11    42 love     fluffy    7.92 minutes        0.00000475        42        6
## 12    45 6969     simple-… 11.1  seconds        0.000000111       45        4
## 13    50 654321   simple-… 18.5  minutes        0.0000111         50        4
## 14    58 123123   simple-… 18.5  minutes        0.0000111         58        7
## 15    60 666666   simple-… 18.5  minutes        0.0000111         60        0
## 16    61 hello    simple-…  3.43 hours          0.000124          61        4
## 17    67 sexy     cool-ma…  7.92 minutes        0.00000475        67        6
## 18    77 1111     simple-… 11.1  seconds        0.000000111       77        0
## 19    80 121212   simple-… 18.5  minutes        0.0000111         80        1
## # … with 1 more variable: font_size <dbl>
plot_1 <- ggplot(data = passtime, mapping = aes(x = reorder(password,-offline_crack_sec), y = offline_crack_sec)) +
  geom_col(aes(fill = offline_crack_sec)) +
  scale_fill_viridis_c() +
  labs(y = "Time to crack offline in seconds", x = "Passwords", title = "Time to crack passwords offline in seconds") +
  coord_flip()
ggplotly(plot_1)

From this, the top worst passwords would be 6969, 2000, 1234 and 1111. From the 2 data above, we can conclude that 2000 and 1111 are amongst some of the weakest passwords (both against online and offline hacking) yet used by a lot. However, the crown would have to go to 2000 as its rank (20) is significantly lower than 1111’s (77).

There may be a logical reasoning behind this conclusion as a lot of people tend to use a date that’s significant to their lives (birthdate, wedding date, etc.) as their passwords and 2000 may be the perfect passcode as it may be a year that is difficult to forget. However, what makes 2000 a different year is that its triple zeroes are easy to guess, both by people and computers, as people are known to use 0 and 1 the most for their passwords.

6 The Best Among The Worst?

passstrong <- passwords_new %>% 
  filter(strength > 1.3*median(strength)) %>% 
  filter(rank < 100)
passstrong
## # A tibble: 4 x 9
##    rank password category value time_unit offline_crack_s… rank_alt strength
##   <dbl> <chr>    <fct>    <dbl> <fct>                <dbl>    <dbl>    <dbl>
## 1    13 abc123   simple-…  3.7  weeks               0.0224       13       32
## 2    22 superman name      6.91 years               2.17         22       10
## 3    26 trustno1 simple-… 92.3  years              29.0          26       25
## 4    66 computer nerdy-p…  6.91 years               2.17         66       10
## # … with 1 more variable: font_size <dbl>
library(ggrepel)
plot7 <- ggplot(data = passwords_new, mapping = aes(x = password, y = rank)) +
  geom_jitter(aes(colour = strength)) +
  geom_label_repel(data = passstrong, aes(label = password), size = 2) +
  facet_wrap(~strength) +
  labs(x = NULL, y = "Rank", title = "Passwords based on rank and strength") +
  theme_minimal() +
  theme(legend.position = "none") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_blank())

plot7

Based on this, we highlighted on these few passwords that are ranked below 100 yet have strength of 1.5 times the median strength of all the passwords.

Let’s see whether these passwords are tough to crack through offline as well.

passhigh <- passwords_new %>% 
  filter(offline_crack_sec > 100*median(offline_crack_sec)) %>% 
  filter(rank < 100)
passhigh
## # A tibble: 13 x 9
##     rank password category value time_unit offline_crack_s… rank_alt strength
##    <dbl> <chr>    <fct>    <dbl> <fct>                <dbl>    <dbl>    <dbl>
##  1     1 password passwor…  6.91 years                 2.17        1        8
##  2     8 baseball sport     6.91 years                 2.17        8        4
##  3     9 football sport     6.91 years                 2.17        9        7
##  4    18 jennifer name      6.91 years                 2.17       18        9
##  5    22 superman name      6.91 years                 2.17       22       10
##  6    26 trustno1 simple-… 92.3  years                29.0        26       25
##  7    41 michelle name      6.91 years                 2.17       41        8
##  8    43 sunshine fluffy    6.91 years                 2.17       43        9
##  9    53 starwars nerdy-p…  6.91 years                 2.17       53        8
## 10    66 computer nerdy-p…  6.91 years                 2.17       66       10
## 11    74 corvette cool-ma…  6.91 years                 2.17       74        8
## 12    83 princess fluffy    6.91 years                 2.17       83        8
## 13    99 iloveyou fluffy    6.91 years                 2.17       99        9
## # … with 1 more variable: font_size <dbl>
plot_2 <- ggplot(data = passhigh, mapping = aes(x = reorder(password, offline_crack_sec), y = offline_crack_sec)) +
  geom_col(aes(fill = offline_crack_sec)) +
  scale_fill_viridis_c() +
  labs(y = "Time to crack offline in seconds", x = "Passwords", title = "Time to crack passwords offline in seconds") +
  coord_flip()
ggplotly(plot_2)

From these plots, we found out that the password trustno1 is the most popular, yet the hardest to crack through online as well as offline. The reason why this may be the hardest to crack is because it uses the combination of letters and numbers that made it difficult to crack. On the other hand, the reason why it is considered popular (rank 26) is because the phrase itself is easy to remember.

7 Last words

In this report, I have analysed various kinds of bad passwords as well as their popularity, strength and category. However, I would like to heed some warning against using these passwords as they are easy to hack into. It is, therefore, advisable to use high strength passwords to ward off hackers and protect crucial information and data stored in our devices and cards.

Lastly, I would like to end this with a comic regarding passwords that may be useful when selecting which password to use in the future.